None
This is the Capstone project of the Google Data Analytics Professional Program on Coursera.
In the hypothetical scenario of the case study I am a junior data analyst working in the marketing department of the fictional Chicago-based bike-sharing company Cyclistic. The company is looking to shift from broad marketing focused on general brand recognition to a more concentrated campaign to convert casual riders into members — those paying for annual subscriptions, as opposed to one-day passes or single rides. The transition in strategy is based on an analysis by the financial department which concluded that annual members are more profitable than casual riders.
The CEO Lily Moreno has tasked me personally with doing an analysis of the differences in the use of the service between casual and annual members. Its insights and recommendations will then be used to design an effective advertising campaign.
In this project I will look at members and casual riders in their differences across four dimensions:
The hypothesis to be tested is that members tend to use Cyclistic as a means of transport to commute to and from work, and that casual riders are more recreational in their purposes.
Although the Coursera certificate teaches the R programming language and suggests (but doesn't require) its use in the project, I chose Python as my tool of choice as my knowledge of it is comparatively more advanced.
I tried to make visualizations as user-friendly and interactive as possible with the help of the Plotly library.
I was given access to publicly available quarterly datasets of the real Chicago-based company Divvy (fictionalized as Cyclistic in the Coursera program) and asked to only consider the last full year of the available data, Q2-2019 to Q1-2020.
First, importing libraries for processing, analyzing and visualizing data.
# Importing libraries
import pandas as pd
import datetime as dt
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.io as pio
# Setting aesthetic parameters for plots
plt.rcParams['figure.figsize'] = (16, 8)
pio.templates.default = 'presentation'
sns.set_style('whitegrid')
# Turning off scientific notation in displaying numbers
pd.set_option('display.float_format', '{:.5f}'.format)
# Turning off warnings for chained assignments
pd.options.mode.chained_assignment = None
# Importing data and declaring variables corresponding to different quarters
q2_2019 = pd.read_csv('divvy_data/Divvy_Trips_2019_Q2.csv')
q3_2019 = pd.read_csv('divvy_data/Divvy_Trips_2019_Q3.csv')
q4_2019 = pd.read_csv('divvy_data/Divvy_Trips_2019_Q4.csv')
q1_2020 = pd.read_csv('divvy_data/Divvy_Trips_2020_Q1.csv')
Let's get a look at the column names in each of the imported dataframes.
# List comprehension for printing the column names of each dataframe
[data.columns for data in [q2_2019, q3_2019, q4_2019, q1_2020]]
[Index(['01 - Rental Details Rental ID', '01 - Rental Details Local Start Time',
'01 - Rental Details Local End Time', '01 - Rental Details Bike ID',
'01 - Rental Details Duration In Seconds Uncapped',
'03 - Rental Start Station ID', '03 - Rental Start Station Name',
'02 - Rental End Station ID', '02 - Rental End Station Name',
'User Type', 'Member Gender',
'05 - Member Details Member Birthday Year'],
dtype='object'),
Index(['trip_id', 'start_time', 'end_time', 'bikeid', 'tripduration',
'from_station_id', 'from_station_name', 'to_station_id',
'to_station_name', 'usertype', 'gender', 'birthyear'],
dtype='object'),
Index(['trip_id', 'start_time', 'end_time', 'bikeid', 'tripduration',
'from_station_id', 'from_station_name', 'to_station_id',
'to_station_name', 'usertype', 'gender', 'birthyear'],
dtype='object'),
Index(['ride_id', 'rideable_type', 'started_at', 'ended_at',
'start_station_name', 'start_station_id', 'end_station_name',
'end_station_id', 'start_lat', 'start_lng', 'end_lat', 'end_lng',
'member_casual'],
dtype='object')]
Forunately, there's a lot of data. Unfortunately, column names aren't standardized and there's more columns for some quarters than others.
To look at trends across the year, let's only include features that appear in all four datasets and exclude information like geography, gender and birthyear that only in appear in some of the tables.
Had the data been in the wide format with the same observations across the four datasets, the pandas inner merge function would be appropriate. In this sutuation, however, it's easier to standardize columns names and order and just concatenate the datasets.
Let's set the q1_2020 dataset as the standard, as it's the most recent and most compliant with the snake_case naming conventions.
# Looking at a sample of the q2_2019 dataset
q2_2019.sample(2)
| 01 - Rental Details Rental ID | 01 - Rental Details Local Start Time | 01 - Rental Details Local End Time | 01 - Rental Details Bike ID | 01 - Rental Details Duration In Seconds Uncapped | 03 - Rental Start Station ID | 03 - Rental Start Station Name | 02 - Rental End Station ID | 02 - Rental End Station Name | User Type | Member Gender | 05 - Member Details Member Birthday Year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1063773 | 23428148 | 2019-06-28 15:05:05 | 2019-06-28 15:20:21 | 3978 | 916.0 | 6 | Dusable Harbor | 18 | Wacker Dr & Washington St | Subscriber | Male | 1978.00000 |
| 827757 | 23150760 | 2019-06-13 16:46:26 | 2019-06-13 17:15:52 | 1795 | 1,766.0 | 251 | Clarendon Ave & Leland Ave | 35 | Streeter Dr & Grand Ave | Subscriber | Female | 1989.00000 |
# Renaming the q2_2019 columns in accordance with the 2020 standard
q2_2019 = \
q2_2019.rename(columns={
'01 - Rental Details Rental ID': 'ride_id',
'01 - Rental Details Local Start Time': 'started_at',
'01 - Rental Details Local End Time': 'ended_at',
'03 - Rental Start Station ID': 'start_station_id',
'03 - Rental Start Station Name': 'start_station_name',
'02 - Rental End Station ID': 'end_station_id',
'02 - Rental End Station Name': 'end_station_name',
'User Type': 'member_casual'
})
# Rewriting the dataframe so as to only include columns that also
# exist in the q1_2020 dataframe
q2_2019 = \
q2_2019.loc[:, q2_2019.columns.isin(q1_2020.columns)]
# Validating the result
q2_2019.sample(5)
| ride_id | started_at | ended_at | start_station_id | start_station_name | end_station_id | end_station_name | member_casual | |
|---|---|---|---|---|---|---|---|---|
| 441628 | 22699127 | 2019-05-17 08:53:21 | 2019-05-17 09:06:18 | 53 | Wells St & Huron St | 25 | Michigan Ave & Pearson St | Customer |
| 458638 | 22719524 | 2019-05-18 23:42:00 | 2019-05-18 23:46:35 | 300 | Broadway & Barry Ave | 13 | Wilton Ave & Diversey Pkwy | Subscriber |
| 73628 | 22264727 | 2019-04-09 11:01:55 | 2019-04-09 11:31:02 | 134 | Peoria St & Jackson Blvd | 85 | Michigan Ave & Oak St | Subscriber |
| 1028828 | 23386204 | 2019-06-26 17:19:22 | 2019-06-26 17:32:30 | 100 | Orleans St & Merchandise Mart Plaza | 175 | Wells St & Polk St | Customer |
| 695001 | 22995954 | 2019-06-05 09:48:28 | 2019-06-05 10:11:59 | 52 | Michigan Ave & Lake St | 91 | Clinton St & Washington Blvd | Customer |
# Renaming the q3_2019 columns in accordance with the 2020 standard
q3_2019 = \
q3_2019.rename(columns={
'trip_id': 'ride_id',
'start_time': 'started_at',
'end_time': 'ended_at',
'from_station_id': 'start_station_id',
'from_station_name': 'start_station_name',
'to_station_id': 'end_station_id',
'to_station_name': 'end_station_name',
'usertype': 'member_casual'
})
# Rewriting the dataframe so as to only include columns that also
# exist in the q1_2020 dataframe
q3_2019 = \
q3_2019.loc[:, q3_2019.columns.isin(q1_2020.columns)]
# Validating the result
q3_2019.sample(5)
| ride_id | started_at | ended_at | start_station_id | start_station_name | end_station_id | end_station_name | member_casual | |
|---|---|---|---|---|---|---|---|---|
| 1424754 | 24999563 | 2019-09-17 08:38:32 | 2019-09-17 08:50:24 | 96 | Desplaines St & Randolph St | 181 | LaSalle St & Illinois St | Subscriber |
| 984298 | 24539744 | 2019-08-23 07:37:59 | 2019-08-23 07:52:23 | 54 | Ogden Ave & Chicago Ave | 287 | Franklin St & Monroe St | Subscriber |
| 1069449 | 24629068 | 2019-08-27 19:05:13 | 2019-08-27 19:40:40 | 284 | Michigan Ave & Jackson Blvd | 195 | Columbus Dr & Randolph St | Customer |
| 1245296 | 24812719 | 2019-09-06 16:17:04 | 2019-09-06 16:31:36 | 195 | Columbus Dr & Randolph St | 338 | Calumet Ave & 18th St | Subscriber |
| 93242 | 23587558 | 2019-07-06 13:02:53 | 2019-07-06 13:22:34 | 324 | Stockton Dr & Wrightwood Ave | 268 | Lake Shore Dr & North Blvd | Subscriber |
# Renaming the q4_2019 columns in accordance with the 2020 standard
q4_2019 = \
q4_2019.rename(columns={
'trip_id': 'ride_id',
'start_time': 'started_at',
'end_time': 'ended_at',
'from_station_id': 'start_station_id',
'from_station_name': 'start_station_name',
'to_station_id': 'end_station_id',
'to_station_name': 'end_station_name',
'usertype': 'member_casual'
})
# Rewriting the dataframe so as to only include columns that also
# exist in the q1_2020 dataframe
q4_2019 = \
q4_2019.loc[:, q4_2019.columns.isin(q1_2020.columns)]
# Validating the result
q4_2019.sample(5)
| ride_id | started_at | ended_at | start_station_id | start_station_name | end_station_id | end_station_name | member_casual | |
|---|---|---|---|---|---|---|---|---|
| 268260 | 25502923 | 2019-10-20 18:46:43 | 2019-10-20 18:59:37 | 57 | Clinton St & Roosevelt Rd | 21 | Aberdeen St & Jackson Blvd | Customer |
| 264234 | 25498730 | 2019-10-20 13:50:14 | 2019-10-20 13:55:31 | 153 | Southport Ave & Wellington Ave | 226 | Racine Ave & Belmont Ave | Subscriber |
| 577900 | 25830311 | 2019-12-05 14:30:50 | 2019-12-05 14:35:23 | 620 | Orleans St & Chestnut St (NEXT Apts) | 359 | Larrabee St & Division St | Subscriber |
| 486252 | 25732882 | 2019-11-19 13:55:22 | 2019-11-19 14:06:03 | 98 | LaSalle St & Washington St | 71 | Morgan St & Lake St | Subscriber |
| 18874 | 25243406 | 2019-10-02 06:16:05 | 2019-10-02 06:53:34 | 324 | Stockton Dr & Wrightwood Ave | 375 | Sacramento Blvd & Franklin Blvd | Subscriber |
# Now doing the same for the 2020 dataset to exclude data
# not present in previous datasets
q1_2020 = \
q1_2020.loc[:, q1_2020.columns.isin(q4_2019.columns)]
# Validating the result
q1_2020.sample(5)
| ride_id | started_at | ended_at | start_station_name | start_station_id | end_station_name | end_station_id | member_casual | |
|---|---|---|---|---|---|---|---|---|
| 379338 | 0376DF07868FADA6 | 2020-03-23 19:10:00 | 2020-03-23 19:20:16 | Larrabee St & Division St | 359 | Wells St & Hubbard St | 212.00000 | member |
| 101024 | 1E22A26BBF936DAF | 2020-01-05 13:55:57 | 2020-01-05 14:26:26 | Streeter Dr & Grand Ave | 35 | Theater on the Lake | 177.00000 | casual |
| 14176 | B56BC954825AB21D | 2020-01-26 21:24:19 | 2020-01-26 21:31:02 | Dearborn St & Monroe St | 49 | Dearborn St & Erie St | 110.00000 | member |
| 298096 | 277E70C8F5E5E001 | 2020-03-12 06:53:50 | 2020-03-12 07:00:34 | Canal St & Adams St | 192 | Michigan Ave & Madison St | 197.00000 | member |
| 98637 | C18C6AB988A2F753 | 2020-01-03 17:42:08 | 2020-01-03 17:44:54 | Wacker Dr & Washington St | 18 | Clinton St & Lake St | 66.00000 | member |
# Changing the order of the columns in q1_2020 to conform to the rest
q1_2020 = \
q1_2020.loc[:, q4_2019.columns]
# Validating the result
q1_2020.sample(5)
| ride_id | started_at | ended_at | start_station_id | start_station_name | end_station_id | end_station_name | member_casual | |
|---|---|---|---|---|---|---|---|---|
| 273376 | F55F68EA8D8AAF03 | 2020-02-19 16:20:53 | 2020-02-19 16:24:12 | 283 | LaSalle St & Jackson Blvd | 284.00000 | Michigan Ave & Jackson Blvd | member |
| 308979 | D4E7959A67DB44E6 | 2020-03-25 12:43:34 | 2020-03-25 12:50:01 | 36 | Franklin St & Jackson Blvd | 321.00000 | Wabash Ave & 9th St | member |
| 253257 | 03FC48C474DEAD82 | 2020-02-02 15:55:33 | 2020-02-02 15:57:36 | 59 | Wabash Ave & Roosevelt Rd | 321.00000 | Wabash Ave & 9th St | member |
| 193379 | 714D92EB0145A71A | 2020-02-25 09:17:48 | 2020-02-25 09:21:23 | 283 | LaSalle St & Jackson Blvd | 81.00000 | Daley Center Plaza | member |
| 118126 | A70D7DE47E92DC27 | 2020-01-07 14:43:35 | 2020-01-07 14:51:31 | 423 | University Ave & 57th St | 419.00000 | Lake Park Ave & 53rd St | member |
Let's check if dataset columns now conform to a single standard by seeing if they are identical through circular comparison.
# The all() function returns True only if all boolean values
# in the array are True
all(q2_2019.columns == q3_2019.columns) \
and all(q3_2019.columns == q4_2019.columns) \
and all(q4_2019.columns == q1_2020.columns)
True
Great, now we can concatenate data into a single dataframe and go on to process it.
# Concatenating data into a single dataframe, df
df = pd.concat([q2_2019, q3_2019, q4_2019, q1_2020])
# Validating the result
df.sample(5)
| ride_id | started_at | ended_at | start_station_id | start_station_name | end_station_id | end_station_name | member_casual | |
|---|---|---|---|---|---|---|---|---|
| 1508604 | 25086316 | 2019-09-21 15:52:50 | 2019-09-21 15:58:05 | 199 | Wabash Ave & Grand Ave | 31.00000 | Franklin St & Chicago Ave (Temp) | Subscriber |
| 5984 | 23486723 | 2019-07-01 12:18:42 | 2019-07-01 12:45:41 | 623 | Michigan Ave & 8th St | 35.00000 | Streeter Dr & Grand Ave | Customer |
| 71467 | 25298452 | 2019-10-06 10:59:27 | 2019-10-06 11:05:34 | 111 | Sedgwick St & Huron St | 110.00000 | Dearborn St & Erie St | Customer |
| 672746 | 22969870 | 2019-06-03 18:27:34 | 2019-06-03 18:31:32 | 260 | Kedzie Ave & Milwaukee Ave | 501.00000 | Richmond St & Diversey Ave | Subscriber |
| 190985 | 23704301 | 2019-07-12 07:36:11 | 2019-07-12 08:00:57 | 334 | Lake Shore Dr & Belmont Ave | 464.00000 | Damen Ave & Foster Ave | Customer |
We can now examine the quality of the available data more closely.
# Printing dataframe's general information
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 3879822 entries, 0 to 426886 Data columns (total 8 columns): # Column Dtype --- ------ ----- 0 ride_id object 1 started_at object 2 ended_at object 3 start_station_id int64 4 start_station_name object 5 end_station_id float64 6 end_station_name object 7 member_casual object dtypes: float64(1), int64(1), object(6) memory usage: 266.4+ MB
The resulting dataframe has an impressive 3.9 million entries for individual rides.
# Counting the total amount of duplicates
df.duplicated().sum()
0
Great, proceeding to missing values.
# Showing the amount of missing values by column
df.isna().sum()
ride_id 0 started_at 0 ended_at 0 start_station_id 0 start_station_name 0 end_station_id 1 end_station_name 1 member_casual 0 dtype: int64
Luckily, the gathered data has just two missing values. Let's see the rows that contain them.
# Seeing the rows with the missing values
df.query('end_station_id.isna() \
or end_station_name.isna()')
| ride_id | started_at | ended_at | start_station_id | start_station_name | end_station_id | end_station_name | member_casual | |
|---|---|---|---|---|---|---|---|---|
| 414426 | 157EAA4C4A3C8D36 | 2020-03-16 11:23:36 | 2020-03-16 11:23:24 | 675 | HQ QR | NaN | NaN | casual |
It turns out that the two missing values are within the same row.
Deleting it won't skew analysis of trends but will enable us to manipulate data more efficiently.
# Deleting rows with missing values
df = df[df.end_station_id.notnull()]
# Validating the result
df.isna().sum()
ride_id 0 started_at 0 ended_at 0 start_station_id 0 start_station_name 0 end_station_id 0 end_station_name 0 member_casual 0 dtype: int64
With missing values out of the way we can now examine data types and change them if necessary.
Let's take a look at the source data types to see if they are optimal for data analysis.
# Printing column data types
df.dtypes
ride_id object started_at object ended_at object start_station_id int64 start_station_name object end_station_id float64 end_station_name object member_casual object dtype: object
The most important flaw is the storage of time data in columns started_at and ended_at as object/strings.
It's useful to change it to the datetime format so as to be able to sort time data, as well as extract month, weekday and hour values.
# Formatting time and date values from string to datetime
df.started_at = pd.to_datetime(df.started_at, format='%Y-%m-%d %H:%M:%S')
df.ended_at = pd.to_datetime(df.ended_at, format='%Y-%m-%d %H:%M:%S')
Next, it's good practice to bring columns containing integer values stored as decimals (float64 data type) to the corresponding format int64.
df['end_station_id'].sample(5)
1434600 94.00000 729622 268.00000 135345 40.00000 863457 420.00000 120870 174.00000 Name: end_station_id, dtype: float64
The column end_station_id is a good example.
# Converting the column from float to integer numbers
df['end_station_id'] = df['end_station_id'].astype('int')
This is more optional, but I will change the member_casual column from the string to the boolean True/False format.
First, let's rename it to indicate subsequent typization.
# Renaming the 'member_casual' column
df = df.rename(columns={
'member_casual': 'is_member'
})
Let's now see distinct values the column contains.
# Seeing different values in the 'is_member' column
df['is_member'].value_counts()
Subscriber 2595461 Customer 857474 member 378407 casual 48479 Name: is_member, dtype: int64
It would only make sense to reduce these to two types as the four terms correspond to only two categorical values, member and casual rider.
# Standardizing different member/non-member classifications as 1 and 0
df.loc[df['is_member'].isin(['Subscriber', 'member']), 'is_member'] = 1
df.loc[df['is_member'].isin(['Customer', 'casual']), 'is_member'] = 0
# Transforming the column to boolean type
df['is_member'] = df['is_member'].astype('bool')
# Validating the transformation
df['is_member'].value_counts(normalize=True)
True 0.76650 False 0.23350 Name: is_member, dtype: float64
To perform the analysis, we need to create four new columns:
First, let's compute the duration for each ride by substracting the timestamp of its start from the timestamp of its end.
# Declaring the duration column as the difference
# between the columns ended_at and started_at
df['duration'] = df['ended_at'] - df['started_at']
Converting the format of the column to integers and renaming it to duration_sec to indicate the unit:
# Extracting the total number of seconds from the datetime data type
df['duration'] = df['duration'].dt.total_seconds().astype(int)
# Eenaming the column to indicate the unit
df = df.rename(columns={
'duration': 'duration_sec'
})
# Validating the result
df.duration_sec.sample(5)
663073 481 1303879 338 704609 785 702502 673 769829 1446 Name: duration_sec, dtype: int64
For further visualization and analysis it is also useful to create a separate column duration_min that will contain rounded numbers of minutes.
# Extracring and rounding the total number of minutes from the
# Duration_sec column
df['duration_min'] = round(df['duration_sec']/60).astype('int')
Creating a separate column with weekday numbers 0 to 6:
# Extracting the weekday number from the started_at column
# Where 0 is Monday and 6 is Sunday
df['weekday'] = pd.DatetimeIndex(df['started_at']).weekday
# Validating the result
df.weekday.sample(5)
633112 6 225632 3 103038 5 1470869 3 1030407 2 Name: weekday, dtype: int64
Creating a separate column for month numbers from 1 to 12:
# Extracting the month number where 1 is January and 12 is December
df['month'] = pd.DatetimeIndex(df['started_at']).month
# Validating the result
df.month.sample(5)
656172 6 1321978 9 321037 10 571457 5 393976 3 Name: month, dtype: int64
Lastly, creating one such column for the hour of the ride:
# Extracting the hour from the started_at column
df['hour'] = pd.DatetimeIndex(df['started_at']).hour
# Validating the result
df.hour.sample(5)
357787 12 362808 19 596405 19 192050 7 501024 11 Name: hour, dtype: int64
To check for consistency and quality of the data we can compute the minimum and maximum values of ride durations:
print(f'''Maximum ride duration: \
{round(df["duration_min"].max()/60)} hours,
Minimum ride duration: \
{df["duration_min"].min()} min''')
Maximum ride duration: 2608 hours, Minimum ride duration: -56 min
Assuming that time travel isn't available even to the most ingenious Cyclistic users, and that bike sharing isn't the first choice of ultramarathon cyclers, data is imperfect and needs a trim.
Any cutoff for legitimate ride duration is bound to be at least somewhat arbitrary as I wasn't able to consult with the domain expert within the company. For the purposes of this analysis let's conservatively assume that a ride shorter than 40 seconds and longer than 8 hours was not a legitimate use of the service.
It may have been a mistake in bike selection in the first place and the abandonment of a bike in the latter.
To see if rows with outliers like this comprise a sizeable chunk of the dataset, it's useful to compute the percentiles of these values — the percentage of values that fall below the cutoff.
# Sorting ride durations to find percentiles of chosen cutoff values
duration_sorted = df['duration_sec'].sort_values()
# Finding the percentiles of chosen cutoff values by
# dividing the index of their position in the sorted Series,
# dividing it by the length of the Series,
# multiplying by 100 and rounding to the second decimal
print(f'''40 seconds is in the \
{round(duration_sorted.searchsorted(40)
/len(duration_sorted)*100, 2)}%-th percentile of values in the\
duration_sec column.
8 hours is in the \
{round(duration_sorted.searchsorted(8*60*60)
/len(duration_sorted)*100, 2)}%-th percentile.''')
40 seconds is in the 0.18%-th percentile of values in the duration_sec column. 8 hours is in the 99.86%-th percentile.
Luckily, duration outliers are very infrequent.
Creating a separate dataframe without outliers to easily calculate the percentage of data lost during the triming:
# Creating a dataframe without rows with implausible ride durations
df_cleaned = \
df[
(df['duration_sec'] >= 40)
& (df['duration_sec'] <= 8*60*60)
]
# Finding the share of rows deleted from the dataset, rounding to the third decimal
# and expressing as a percentage
print(f'''\
trimming {round((df.shape[0] - df_cleaned.shape[0])
/df.shape[0], 3)*100}\
% of data''')
trimming 0.3% of data
This is more than acceptable, so we can now proceed with the cleaned dataframe.
# Overwriting the original dataframe variable
# with the cleaned data
df = df_cleaned
To check data for quality in another way, we can count the number of unique stations and compare it with the number given in the case study description, 692.
# Computing the number of unique station id's
print(f'Number of unique stations: {df.start_station_id.nunique()}')
Number of unique stations: 618
This is within the credible range, and the difference is probably explained by some stations being out of order in the period studied, or not yet being open to the public.
Lastly, we can check if ride id's in our dataframe are truly unique by comparing the length of the dataframe with the number of unique ride id's:
# Checking if all ride id's are unique
df.shape[0] == df.ride_id.nunique()
True
All good, now we can proceed with analysis.
To answer the business question of how casual riders and members differ in their use of Cyclistic, we will visualize their differences across months, weekdays, hours and ride durations.
Let's first look at the difference between members and casual riders in their use of the service by month.
# Creating a pivot table for visualization
df_pivot_months = (
pd.pivot_table(
# Grouping by month and membership status
df,
index=['month', 'is_member'],
values='ride_id',
# Counting the number of rides
aggfunc={
'ride_id': 'count'
}).reset_index().rename({'ride_id': 'count'}, axis=1)
# Replacing boolean values for visualization
# purposes
.replace({
False: 'casual riders',
True: 'members'
}))
df_pivot_months['month'] = \
df_pivot_months['month'].replace(
# Replacing integers with months for better visualization
{1:'Jan',
2:'Feb',
3:'Mar',
4:'Apr',
5:'May',
6:'Jun',
7:'Jul',
8:'Aug',
9:'Sep',
10:'Oct',
11:'Nov',
12:'Dec'})
# Validating the pivot table
df_pivot_months.head(5)
| month | is_member | count | |
|---|---|---|---|
| 0 | Jan | casual riders | 7592 |
| 1 | Jan | members | 135057 |
| 2 | Feb | casual riders | 12110 |
| 3 | Feb | members | 125541 |
| 4 | Mar | casual riders | 24335 |
# Plotting differences across months
# with the plotly express library
fig = px.bar(
df_pivot_months,
x='month',
y='count',
color='is_member',
# Changing labels for clarity
labels={
'count': 'Number of rides',
'month': 'Month',
'is_member': ''
},
# Unstacking member and
# casual rider bars
barmode='group',
# Formatting the count number output
text_auto='.3s',
# Setting the title
title="Number of rides per month")
# Adjusting count number traces
fig.update_traces(textfont_size=13, textposition="outside")
# Adjusting the grid
fig.update_yaxes(gridwidth=2)
fig.show()
The bar chart shows a clear difference in the way members and non-members use bike sharing across months. Specifically, it shows that while casual riders largely stop using the service in winter — presumably because of lower temperatures in Chicago — members still do, albeit in considerably lower numbers.
To quantify the staggering difference: the amount of rides by members in the least active month, March, is around 3.5 less than the amount in the most popular month, August. For non-members that proportion (August / January) is around 24.5. It turns out that the seasonal dropoff in use is around 7 times greater for casual riders as opposed to members.
The relative consistency in the use of the service by members does suggest that they use it for regular transportation purposes.
Now for the difference in use across weekdays.
# Creating a pivot table for visualization
df_pivot_weekdays = (
pd.pivot_table(
df,
# Grouping by weekday and membership status
index=['weekday', 'is_member'],
values='ride_id',
# Counting the number of rides
aggfunc='count').reset_index()
# Replacing boolean values for visualization
# purposes
.replace({
False: 'casual riders',
True: 'members'
}).rename({'ride_id': 'count'}, axis=1))
df_pivot_weekdays = df_pivot_weekdays.replace({
# Replacing integers with weekdays for better visualization
0: 'Mon',
1: 'Tue',
2: 'Wed',
3: 'Thu',
4: 'Fri',
5: 'Sat',
6: 'Sun'
})
# Validating the pivot table
df_pivot_weekdays.head(5)
| weekday | is_member | count | |
|---|---|---|---|
| 0 | Mon | casual riders | 102880 |
| 1 | Mon | members | 471519 |
| 2 | Tue | casual riders | 90037 |
| 3 | Tue | members | 507748 |
| 4 | Wed | casual riders | 91978 |
# Plotting differences across weekdays
fig = px.bar(
df_pivot_weekdays,
x='weekday',
y='count',
color='is_member',
# Changing labels for clarity
labels={
'count': 'Number of rides',
'weekday': 'Weekday',
'is_member': ''
},
# Unstacking member and
# casual rider bars
barmode='group',
# Formatting the count number output
text_auto='.3s',
title="Number of rides per weekday")
# Adjusting count number traces
fig.update_traces(textfont_size=13, textposition="outside")
# Adjusting the grid
fig.update_yaxes(gridwidth=2)
fig.show()
Interestingly, members tend to use Cyclistic on workdays, whereas casual riders prefer to ride on weekends. This makes sense given the hypothesis — member use of the service falls off due to most of the population not working on weekends.
Let's look at the differences in member and non-member use of bikes across hours. It makes sense to plot these differences twice — once for rides on workdays and once for rides on weekends.
# Creating a pivot table for visualization
df_pivot_hours_workdays = (
pd.pivot_table(
# Choosing only weekdays from 0 to 4 (Mon-Fri)
df[df['weekday'] <= 4],
# Grouping by hour and membership status
index=['hour', 'is_member'],
values=['ride_id', 'duration_min'],
# Counting the number of rides
# as well as finding the median duration
aggfunc=({
'ride_id': 'count',
'duration_min': 'median'
})).reset_index()
# Replacing boolean values for visualization
# purposes
.replace({
False: 'casual riders',
True: 'members'
}).rename({'ride_id': 'count'}, axis=1))
# Validating the pivot table
df_pivot_hours_workdays.head(5)
| hour | is_member | duration_min | count | |
|---|---|---|---|---|
| 0 | 0 | casual riders | 22 | 3955 |
| 1 | 0 | members | 9 | 7927 |
| 2 | 1 | casual riders | 23 | 2364 |
| 3 | 1 | members | 9 | 3969 |
| 4 | 2 | casual riders | 21 | 1324 |
# Plotting differences across hours
fig = px.bar(
df_pivot_hours_workdays,
x='hour',
y='count',
color='is_member',
# Including hover data with ride duration
hover_data=['duration_min'],
# Changing labels for clarity
labels={
'count': 'Number of rides',
'hour': 'Hour',
'is_member': '',
'duration_min': 'Median duration'
},
# Unstacking member and
# Casual rider bars
barmode='group',
# Adjusting the title
title='Number of rides per hour on workdays')
# Showing all of the x axis labels
fig.layout.xaxis.dtick = 1
# Adjusting the grid
fig.update_yaxes(gridwidth=2)
fig.show()
It is now perfectly clear that member rides spike during commute times on workdays. For the year analyzed there have been more than 350k member rides that started at the hour 17, as opposed to, for example, 90k that started around midday.
Rides by casual members, on the other hand, are more evenly distributed across the day with the same peak — albeit much less pronounced — around 5pm.
Furthermore, hover data on the plot shows that the median ride duration doesn't fluctuate almost whatsoever across hours for members staying at around 10 minutes. It does for casual riders, however, being at its peak at 30 minutes during daytime and dropping to about 20 minutes in the evening.
Let's now build the same kind of bar chart, but filter to only include weekends.
# Creating a pivot table for visualization
df_pivot_hours_weekends = (
pd.pivot_table(
# Choosing only weekdays 5 and 6 (Sat, Sun)
df[df['weekday'] >= 5],
# Grouping by hour and membership status
index=['hour', 'is_member'],
values='ride_id',
# Counting the number of rides
aggfunc='count').reset_index()
# Replacing boolean values for visualization
# Purposes
.replace({
False: 'casual riders',
True: 'members'
}).rename({'ride_id': 'count'}, axis=1))
# Validating the pivot table
df_pivot_hours_weekends.head(5)
| hour | is_member | count | |
|---|---|---|---|
| 0 | 0 | casual riders | 4308 |
| 1 | 0 | members | 7777 |
| 2 | 1 | casual riders | 3056 |
| 3 | 1 | members | 4966 |
| 4 | 2 | casual riders | 1980 |
# Plotting differences across hours
fig = px.bar(
df_pivot_hours_weekends,
x='hour',
y='count',
color='is_member',
# Changing labels for clarity
labels={
'count': 'Number of rides',
'hour': 'Hour',
'is_member': ''
},
# Unstacking member and
# casual rider bars
barmode='group',
title="Number of rides per hour on weekends")
# Showing all of the x axis labels
fig.layout.xaxis.dtick = 1
# Adjusting the grid
fig.update_yaxes(gridwidth=2)
fig.show()
Interestingly, frequency distributions for casual riders and members are almost symmetrical on weekends with both kinds of users tending to prefer daytime rides.
The last aspect of difference between casual riders and members we'll look at is the average and median ride durations.
# plotting differences in ride durations
# using seaborn and not plotly due to
# plotly crashing when processing the data
fig = sns.violinplot(data=df,
x=df['is_member']\
# replacing boolean values for visualization
# purposes
.replace({False:'casual riders',
True:'members'}),
y='duration_min',
# standardizing the visual style
saturation=0.9)
fig.set_title('Ride duration by membership status')
fig.set_xlabel('')
fig.set_ylabel('Duration in minutes');
Not the most helpful plot — extreme values (ride durations over two hours) dominate the visual figure. Let's limit the y axis and plot it again, as well as print median and mean values for members and casual riders.
# plotting differences in ride durations
fig = sns.violinplot(data=df,
x=df['is_member']\
# replacing boolean values for visualization
# purposes
.replace({False:'casual riders',
True:'members'}),
y='duration_min',
# standardizing the visual style
saturation=0.9)
fig.set_title('Ride duration by membership status')
fig.set_xlabel('')
fig.set_ylabel('Duration in minutes')
# limiting the y axis
plt.ylim(0, 100)
# embedding calculations for menas and medians inside f-strings
print(f'''The median and mean durations of rides for members are \
{round(df.loc[df['is_member'] == 1, 'duration_min'].median())}min and \
{round(df.loc[df['is_member'] == 1, 'duration_min'].mean())}min.
For non-members the statistics are \
{round(df.loc[df['is_member'] == 0, 'duration_min'].median())}min and \
{round(df.loc[df['is_member'] == 0, 'duration_min'].mean())}min,\
respectively.''')
The median and mean durations of rides for members are 10min and 13min. For non-members the statistics are 26min and 37min,respectively.
It turns out that casual riders are more likely to take longer rides, whereas judging by the boxplot inside of the left violin, more than 80% of member rides are less than 20 minutes long.
A reasonable hypothesis is that members treat Cyclistic as a good way to get from point A to point B, whereas casual riders tend to like the recreational aspect of a nice leisurely bike ride.
To recap the whole process, first we standardised the four separate available datasets corresponding to the most recent quarters.
Then, we validated the quality of the data, dealt with missing values, transformed data types and created additional columns for further analysis.
Finally, we visualized the differences between members and casual riders in their use of Cyclistic across months, weekdays, and hours in user-friendly interactive bar plots, and used violin plots to visualize the difference in ride durations between the two groups.
Our analysis shows that members tend to use Cyclistic as a means of transportation and commuting, whereas casual riders use the service as a leisurely activity.
Building on this insight, here are three recommendations for the marketing campaign to convert casual riders for members: